DeepEyes is a vision-language model that encourages 'thinking with images' through reinforcement learning. It can directly integrate visual information into the reasoning chain and performs excellently in image-text processing tasks.
Text-to-Image
Transformers English